Against the multi-core paradigm

Computers have come a long way: back in the '90s, clock rates were one to two orders of magnitude lower, and many CPUs did not yet compute multiple instructions in parallel (pipelining). In the early 2000's, we experienced the clock rate race of CPU vendors, which peaked at roughly 4GHz due to rising thermal dissipation. Our manifacturing processes are many times smaller than back then, shrinking our circuitry and in turn allowing for more complex CPU designs. However, because clock rates did no longer go up significantly, in order to get more performance out of a chip, vendors started implementing various systems with multiple CPU cores or hyper-threaded virtual cores.

Now, I cannot condone traditional multi-core CPUs, because they mostly enhance performance in a multi-process setting, where many independent programs run faster at once than on a single-core system. The other case they enhance is the performance of parallel computations on homogenous data, albeit not as well as SIMD/superscalar computing. However, running many tasks at once is not the bottleneck of computers (besides mainframes, maybe, but certainly not personal computers). The bottleneck today is that we want low latencies to complex workloads, such as compiling changes we made to a program so we can quickly iterate changes. Nobody is running hundreds of CPU-heavy programs at once and wants them all to be fast. We run one program, and we want the result as soon as possible.

However, accelerating a single program is not easy, as there are a lot of dependency chains that create a linear critical path. To accelerate a mostly linear program (and we can't really increase clock speeds anymore), we need to parallelise all parts that can be parallelised. However, due to the mostly linear nature of programs, we have very little opportunity to do so, maybe short batches of 5-10 instructions, or pure functions.

Short batches of instructions can be recognised by a CPU that decodes instructions faster than it can execute them, so it can view ahead, and recognise causally independent computations, executing them ahead of time. However, function calls act as a strong barrier to ad-hoc ahead-of-time execution, because the CPU has no real way of knowing what side-effects the function would have that would forbid code after the call from being executed. For example, when we write sin(x) + cos(y), and both become function calls, the assembly would look like this:

1 PUSH x
2 CALL sin
3 PUSH res ; save sin(x) for later
4 PUSH y
5 CALL cos
6 POP a ; restore sin(x)
7 ADD res, a ; res = cos(y) + sin(x)

To the CPU, it is by no means obvious what the call to sin in line 2 changes or leaves unchanged. It may change any memory location, it may change any registers. And the CPU also does not know whether the call to cos in line 3 would depend on any of those changes or not. This would require programming language and compiler support, and special hints in the ISA that would indicate the causal dependencies to the CPU. This means that ahead-of-time/out-of-order execution is fairly limited in its capabilities to handle a lot of common parallelisable workloads.

The next tool for speeding up programs would be to direct another core to execute code in a thread. The overhead of creating a thread is extremely expensive, though, and does not pay off until the workload of that thread becomes fairly large. I am sure that much of the overhead comes from the fact that the OS gatekeeps access to other CPU cores, but that is because it already promised those cores to other processes. Something as simple as a square root should not be executed in a new thread, as the thread creation and synchronisation overhead would outweigh the parallelised workload by many times.

Thus, modern CPUs, especially in combination with the modern OS and programming language paradigms, are unable to properly execute a single program as fast as possible, and by far at that. Instead of bolting more cores onto a CPU that will make it run more programs at mediocre speeds, we should be maximising single-core performance. Only when any single program manages to completely bottleneck the available memory bandwidth for both data access and instruction fetching, should we look for other things in CPU design.

Now, what do I propose for achieving that? The solution is quite simple, actually:

Add a generic SIMD mode to the CPU ISA. In this mode, the processor should be able to compute the same operations on many different values at once. This SIMD mode must support the full programmability of the standard SISD mode, with the exception of conditional branches. This is strictly superior to parallel execution, because that would require the fetching of the same instructions on many cores, wasting memory bandwidth and instruction cache space. It also benefits from much more performant synchronisation than multi-core systems could ever achieve. A 3D vector dot product could look like this:
```
1 SIMD 3
2 LOAD A, [.vector + SIMD_I * 4]
3 MUL A, A
4 SISD
5 SUM A
```
Recursion during SIMD mode is somewhat tricky, but possible, the same goes for variable-length loops.
Add fork/join commands to the CPU ISA. These must be extremely efficient, with overheads no larger than 2-3 cycles, and should allow the CPU to spawn what I call a micro-thread: a short-lived parallel execution stream that executes a single function call, and upon returning, communicates the result register back to the caller. The CPU should contain multiple execution units that may be activated with a FORK instruction, which would act as a function call. The above example would look like this:
```
1 PUSH x
2 FORK0 sin ; call sin(x), parallel if possible
3 PUSH y
4 CALL cos ; call cos(x)
5 JOIN0 a   ; await sin(x) to return, retrieve result
6 ADD res, a
```
Here, in line 2, we request the execution of sin in a parallel execution unit. In case such a unit is available, the function is dispatched to it, otherwise, it is executed like a normal CALL. The CPU ISA should specify a maximum number of parallel execution units, because the CPU needs to store the result result register of a FORK instruction separately from the normal stack, so that it can be retrieved via the corresponding JOIN, and the CPU must support as many concurrent forks as the ISA allows. Forking is slightly tricky, because the stack can no longer be continuous, but this can be solved by introducing separate, pre-allocated stacks for all fork units. These are set up once at program startup.

This is completely different from having multiple cores, because a fork unit is a slave, while multi-core designs have no hierarchy. The forked routine may not outlast its parent routine, so it does not allow true multi-threading, and thereby forces the unit to be used for speeding up a single program.
Add more ALUs (or just the most common parts, such as adders, comparators) and instruction decoding units, make instructions efficiently batch-decodable. The more operands an ALU can consume per cycle, and the more instruction streams we can decode at once, the more efficient the SIMD and fork units become. Although multiple instruction streams are good for filling up idle cycles in a single ALU, our goal is to exceed the maximum throughput of a single ALU multiple times over, so we require multiple ALUs. Of course, some expensive operations may not be feasible to reproduce often, such as multipliers, but these also suffer the least from congestion, as they have longer pipelines, so relative latency increases are smaller and parallelism is higher. Modern CPUs can dispatch two instructions at once, and multiplex them over multiple ALUs, although not all operations can also be processed at that speed (such as multiplications and divisions). But we want to dispatch 4, 8, or even 16 instructions at once using fork units and regular ahead-of-time execution, so that we can have multiple ALUs with fully packed pipelines, especially in SIMD mode, which dispatches many operations per instruction. We want to fully utilise the memory and ALU bandwidth.

With the correct design and tooling, it should be possible to achieve another big leap in single-program execution speeds. This could even make FPGA-based CPUs viable for every-day use, as they, although operating at lower clock speeds, are naturally tending to have long pipelines (as they execute less logic per cycle than ASICs at the same clock rate), which naturally lend themselves well to massive parallelism. If programs are written with SIMD for loops, these pipelines can achieve many times the throughput compared to conventional ASIC designs, at only minor relative increases in latency. And due to micro-threading, critical paths of operations are no longer that critical, because even small instruction streams can be fully parallelised with low overhead, so although we are slower linearly, we can harness much greater levels of parallelisation from the same program than regular CPUs can. Once this approach is put into an ASIC, it would probably prove to be quite a bit superior to nowaday's multi-core CPU designs.